This markdown document will answer the questions shared by DiDi hiring manager. All of these are answered using SQL and R script.
A localhost database was created in order to show the query results. Because of that a simple run of the script wont publish the result again in another computer.
I’ll present each question and follow it by the chunk of SQL needed to answer it.
Write the SQL queries necessary to generate a list of the five restaurants that have the highest average number of visitors on holidays. The result table should also contain that average per restaurant.
IMPORTANT: I didn’t consider days without operation.
SELECT id as restaurant_id, avg(visitors) as avg_daily_visitors_on_holiday
FROM
(SELECT rv.id, visit_date, sum(rv.reserve_visitors) as visitors
FROM restaurant_visitors rv
LEFT JOIN date_info di on rv.visit_date = di.calendar_date
WHERE di.holiday_flg = 1
GROUP BY id, visit_date) hrv
GROUP BY id
ORDER BY avg_daily_visitors_on_holiday desc
LIMIT 5
| restaurant_id | avg_daily_visitors_on_holiday |
|---|---|
| 0a74a5408a0b8642 | 33.27273 |
| e053c561f32acc28 | 30.63636 |
| db80363d35f10926 | 25.20000 |
| 42c9aa6d617c5057 | 18.85714 |
| 36429b5ca4407b3e | 18.44444 |
The top restaurant has an average of 33 daily customers on holidays:
top_five_rest_in_holiday.barplot
But in reality, the 4th and 5th could be rearranged, if instead of using the mean we use the median.
top_five_rest_in_holiday.boxplot
Its important to view the full distribution when a mean vs mean analysis is requested.
top_five_rest_in_holiday.densityplot
Use SQL to discover which day of the week there are usually more visitors on average in restaurants.
IMPORTANT: I didn’t consider days without operation.
SELECT
day_of_week,
avg(visitors) as avg_visitors_by_weekday
FROM
(SELECT
rv.visit_date,
di.day_of_week,
sum(reserve_visitors) as visitors
FROM restaurant_visitors rv
LEFT JOIN date_info di on rv.visit_date = di.calendar_date
GROUP BY visit_date, day_of_week) vrw
GROUP BY day_of_week
ORDER BY avg_visitors_by_weekday DESC
| day_of_week | avg_visitors_by_weekday |
|---|---|
| Saturday | 193.85246 |
| Friday | 181.08197 |
| Thursday | 114.50847 |
| Wednesday | 92.26230 |
| Sunday | 79.96667 |
| Monday | 79.69492 |
| Tuesday | 76.98246 |
Saturday is the weekday with more avg costumers (bar plot). But the median (blue points) is almost the same on Friday and Saturday.
weekday_w_more_visitors.barplot
The distribution of Saturday tends to be higher, just with almost the same median as Friday. The distribution is skewed to the right, that’s whey there’s a big difference between median and mean.
weekday_w_more_visitors.boxplot
SELECT
day_of_week,
avg(visitors) as avg_visitors_by_store_by_weekday
FROM
(SELECT
id,
visit_date,
day_of_week,
sum(reserve_visitors) as visitors
FROM restaurant_visitors rv
LEFT JOIN date_info di on rv.visit_date = di.calendar_date
GROUP BY id, visit_date, day_of_week) vrw
GROUP BY day_of_week
ORDER BY avg_visitors_by_store_by_weekday DESC
| day_of_week | avg_visitors_by_store_by_weekday |
|---|---|
| Saturday | 17.06349 |
| Friday | 15.93939 |
| Thursday | 12.17297 |
| Wednesday | 12.02564 |
| Tuesday | 11.73262 |
| Monday | 10.56629 |
| Sunday | 10.54505 |
In average, Saturday is the weekday with more avg visitors by restaurant (17). But again, the distribution of Friday and Saturday is not that different. The median number of visitors on each store in Friday and Saturday is 12. There are multiple outliers in all weekdays
weekday_w_more_visitors_by_rest.barplot
weekday_w_more_visitors_by_rest.boxplot
How was the percentage of growth of the amount of visitors’ week over week for the last four weeks of the data? You can solve this question using SQL or any other tool that you prefer. If you use other tools, please add your code or files.
WITH
last_date as (
SELECT date((max(visit_date) - 4*7))
FROM restaurant_visitors
),
daily_visitors as (
SELECT
DATE_SUB(visit_date, INTERVAL DAYOFWEEK(visit_date)-2 DAY) as visit_week,
sum(reserve_visitors) as visitors
FROM restaurant_visitors rv
LEFT JOIN date_info di on rv.visit_date = di.calendar_date
GROUP BY visit_week
ORDER BY visit_week
),
weekly_growth as (
(SELECT
visit_week,
visitors,
visitors /
lag(visitors, 1) OVER (ORDER BY visit_week) - 1 as weekly_visitors_growth
FROM daily_visitors)
)
SELECT *
FROM weekly_growth
WHERE visit_week >= (select * from last_date)
| visit_week | visitors | weekly_visitors_growth |
|---|---|---|
| 2017-05-08 | 175 | 0.0670732 |
| 2017-05-15 | 78 | -0.5542857 |
| 2017-05-22 | 63 | -0.1923077 |
| 2017-05-29 | 15 | -0.7619048 |
There has been a continuous decrease in visitors the last 4 weeks
visitors_weekly_growth_percentage.lineplot
There’s a gap in 2016, from the weeks of Jul 25 to Sep 12. And it appears that the number of visitors drastically decreased from the week of Apr 17, 2017 onward.
weekly_visitors.lineplot
monthly_visitors.lineplot
daily_visitors.lineplot
The number of visitors last month drastically decreased. It appears that it went to the same average of 2016. But the number of unique restaurants didn’t decreased as much.
monthly_restaurants.lineplot
The average visitors by restaurant decreased in May 2017. But in general is around 161 visitors.
monthly_avg_visitors_by_rest.lineplot
Forecast for the next six months, after the last date of the data, the sum of visitors of all the restaurants and validate the accuracy of your forecast.
I created two forecasts and challenged them against each other:
The model with the least MAE was be used to forecast the next 6 months.
monthly_visitors_fcst.lineplot
The mode shows that Jun is being forecasted with the same bouncing effect that Jun 2016 had. Keeping close attention the the forecast and updating if required is important so the process improves month over month.
I’ll show the results of the three time series: * Number of average visitors by restaurant * Number of restaurants * Number of visitors
Given the change of distribution in the behavior of the time series and the shortage of data, I created the forecast using ARIMA, Averages, Naive, Seasonal Naive, and TBATS. And I will select the best forecast comparing the MAE of each model using a training and test set.
Applying and auto.arima, p and q are 0. For average visitors by restaurant is purely white noise (d is 0). For restaurants is a random walk (d is 1). For total visitors is, again, a random walk (d is 1). Given average visitors is white noise the point forecast will be the same as a mean method. And because restaurants and total visitors are a random walk, the point forecast be the same as a naive model. Reference: https://people.duke.edu/~rnau/411arim.htm
arima_models[[1]]
## Series: subseries
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 159.1783
## s.e. 30.3838
##
## sigma^2 estimated as 13001: log likelihood=-79.5
## AIC=163 AICc=164.2 BIC=164.13
arima_model_result[[1]][,'MAE']
## Training set Test set
## 78.20526 78.40482
The gap between train and test set MAE is really high
arima_models[[2]]
## Series: subseries
## ARIMA(0,1,0)
##
## sigma^2 estimated as 68: log likelihood=-42.34
## AIC=86.69 AICc=87.09 BIC=87.17
arima_model_result[[2]][,'MAE']
## Training set Test set
## 3.846538 3.000000
The gap between train and test set MAE is really high
arima_models[[3]]
## Series: subseries
## ARIMA(0,1,0)
##
## sigma^2 estimated as 4784777: log likelihood=-109.31
## AIC=220.63 AICc=221.03 BIC=221.11
arima_model_result[[3]][,'MAE']
## Training set Test set
## 1255.147 2137.250
Residuals on avg visitors and restaurant are white noise. I could rely on the results.
avg_models[[1]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
## Mar 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
## Apr 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
## May 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
## Jun 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
## Jul 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
## Aug 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
## Sep 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
## Oct 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
## Nov 2 159.1783 -1.299244 319.6558 -98.63475 416.9914
avg_model_result[[1]][,'MAE']
## Training set Test set
## 78.20526 78.40482
avg_models[[2]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
## Mar 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
## Apr 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
## May 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
## Jun 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
## Jul 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
## Aug 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
## Sep 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
## Oct 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
## Nov 2 12.15385 -5.062737 29.37043 -15.50523 39.81292
avg_model_result[[2]][,'MAE']
## Training set Test set
## 10.67456 14.84615
avg_models[[3]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
## Mar 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
## Apr 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
## May 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
## Jun 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
## Jul 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
## Aug 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
## Sep 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
## Oct 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
## Nov 2 2248.769 -1789.273 6286.811 -4238.493 8736.032
avg_model_result[[3]][,'MAE']
## Training set Test set
## 2130.107 3626.865
Residuals on avg visitors and restaurant are not white noise, that means that something underlyin can be explained with more analysis. But I’ll keep the results if they succeed on the challenge.
naive_models[[1]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2 206.8276 31.51611 382.1391 -61.28818 474.9434
## Mar 2 206.8276 -41.10028 454.7555 -172.34537 586.0005
## Apr 2 206.8276 -96.82080 510.4760 -257.56255 671.2177
## May 2 206.8276 -143.79537 557.4505 -329.40395 743.0591
## Jun 2 206.8276 -185.18080 598.8360 -392.69750 806.3527
## Jul 2 206.8276 -222.59608 636.2513 -449.91924 863.5744
## Aug 2 206.8276 -257.00299 670.6582 -502.54006 916.1952
## Sep 2 206.8276 -289.02815 702.6833 -551.51833 965.1735
## Oct 2 206.8276 -319.10685 732.7620 -597.51972 1011.1749
## Nov 2 206.8276 -347.55599 761.2112 -641.02892 1054.6841
naive_model_result[[1]][,'MAE']
## Training set Test set
## 102.85999 69.98649
naive_models[[2]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2 29 18.4320551 39.56794 12.837723 45.16228
## Mar 2 29 14.0546689 43.94533 6.143089 51.85691
## Apr 2 29 10.6957824 47.30422 1.006115 56.99389
## May 2 29 7.8641101 50.13589 -3.324554 61.32455
## Jun 2 29 5.3693567 52.63064 -7.139950 65.13995
## Jul 2 29 3.1139273 54.88607 -10.589332 68.58933
## Aug 2 29 1.0398458 56.96015 -13.761366 71.76137
## Sep 2 29 -0.8906621 58.89066 -16.713823 74.71382
## Oct 2 29 -2.7038348 60.70383 -19.486831 77.48683
## Nov 2 29 -4.4187762 62.41878 -22.109608 80.10961
naive_model_result[[2]][,'MAE']
## Training set Test set
## 4.166667 3.000000
naive_models[[3]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2 5998 3194.7170 8801.283 1710.74861 10285.25
## Mar 2 5998 2033.5592 9962.441 -65.08906 12061.09
## Apr 2 5998 1142.5714 10853.429 -1427.73723 13423.74
## May 2 5998 391.4340 11604.566 -2576.50277 14572.50
## Jun 2 5998 -270.3314 12266.331 -3588.58554 15584.59
## Jul 2 5998 -868.6130 12864.613 -4503.57830 16499.58
## Aug 2 5998 -1418.7897 13414.790 -5345.00098 17341.00
## Sep 2 5998 -1930.8817 13926.882 -6128.17811 18124.18
## Oct 2 5998 -2411.8490 14407.849 -6863.75416 18859.75
## Nov 2 5998 -2866.7592 14862.759 -7559.47928 19555.48
naive_model_result[[3]][,'MAE']
## Training set Test set
## 1359.667 2137.250
Seasonal Naive is tricky to use when only one season has passed. Nonetheless here are the results.
snaive_models[[1]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2 144.66667 73.12074 216.21260 35.246608 254.0867
## Mar 2 186.71429 115.16836 258.26022 77.294227 296.1343
## Apr 2 191.42857 119.88264 262.97450 82.008513 300.8486
## May 2 119.00000 47.45407 190.54593 9.579942 228.4201
## Jun 2 401.66667 330.12074 473.21260 292.246608 511.0867
## Jul 2 160.00000 88.45407 231.54593 50.579942 269.4201
## Aug 2 0.00000 -71.54593 71.54593 -109.420058 109.4201
## Sep 2 1.00000 -70.54593 72.54593 -108.420058 110.4201
## Oct 2 31.06897 -40.47696 102.61490 -78.351093 140.4890
## Nov 2 160.30000 88.75407 231.84593 50.879942 269.7201
## Dec 2 315.64516 244.09923 387.19109 206.225103 425.0652
## Jan 3 206.82759 135.28166 278.37352 97.407528 316.2476
## Feb 3 144.66667 43.48544 245.84789 -10.076664 299.4100
## Mar 3 186.71429 85.53306 287.89551 31.970955 341.4576
## Apr 3 191.42857 90.24735 292.60980 36.685241 346.1719
## May 3 119.00000 17.81877 220.18123 -35.743330 273.7433
## Jun 3 401.66667 300.48544 502.84789 246.923336 556.4100
## Jul 3 160.00000 58.81877 261.18123 5.256670 314.7433
## Aug 3 0.00000 -101.18123 101.18123 -154.743330 154.7433
## Sep 3 1.00000 -100.18123 102.18123 -153.743330 155.7433
## Oct 3 31.06897 -70.11226 132.25019 -123.674365 185.8123
## Nov 3 160.30000 59.11877 261.48123 5.556670 315.0433
## Dec 3 315.64516 214.46394 416.82639 160.901831 470.3885
## Jan 4 206.82759 105.64636 308.00881 52.084256 361.5709
snaive_model_result[[1]][,'MAE']
## Training set Test set
## 55.82759 62.61794
snaive_models[[2]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2 5 -25.7572376 35.75724 -42.03914 52.03914
## Mar 2 6 -24.7572376 36.75724 -41.03914 53.03914
## Apr 2 7 -23.7572376 37.75724 -40.03914 54.03914
## May 2 6 -24.7572376 36.75724 -41.03914 53.03914
## Jun 2 3 -27.7572376 33.75724 -44.03914 50.03914
## Jul 2 7 -23.7572376 37.75724 -40.03914 54.03914
## Aug 2 0 -30.7572376 30.75724 -47.03914 47.03914
## Sep 2 1 -29.7572376 31.75724 -46.03914 48.03914
## Oct 2 28 -2.7572376 58.75724 -19.03914 75.03914
## Nov 2 30 -0.7572376 60.75724 -17.03914 77.03914
## Dec 2 31 0.2427624 61.75724 -16.03914 78.03914
## Jan 3 29 -1.7572376 59.75724 -18.03914 76.03914
## Feb 3 5 -38.4973025 48.49730 -61.52338 71.52338
## Mar 3 6 -37.4973025 49.49730 -60.52338 72.52338
## Apr 3 7 -36.4973025 50.49730 -59.52338 73.52338
## May 3 6 -37.4973025 49.49730 -60.52338 72.52338
## Jun 3 3 -40.4973025 46.49730 -63.52338 69.52338
## Jul 3 7 -36.4973025 50.49730 -59.52338 73.52338
## Aug 3 0 -43.4973025 43.49730 -66.52338 66.52338
## Sep 3 1 -42.4973025 44.49730 -65.52338 67.52338
## Oct 3 28 -15.4973025 71.49730 -38.52338 94.52338
## Nov 3 30 -13.4973025 73.49730 -36.52338 96.52338
## Dec 3 31 -12.4973025 74.49730 -35.52338 97.52338
## Jan 4 29 -14.4973025 72.49730 -37.52338 95.52338
snaive_model_result[[2]][,'MAE']
## Training set Test set
## 24 21
snaive_models[[3]]
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Feb 2 868 -5657.6606 7393.661 -9112.1366 10848.137
## Mar 2 1307 -5218.6606 7832.661 -8673.1366 11287.137
## Apr 2 1340 -5185.6606 7865.661 -8640.1366 11320.137
## May 2 833 -5692.6606 7358.661 -9147.1366 10813.137
## Jun 2 1205 -5320.6606 7730.661 -8775.1366 11185.137
## Jul 2 1280 -5245.6606 7805.661 -8700.1366 11260.137
## Aug 2 0 -6525.6606 6525.661 -9980.1366 9980.137
## Sep 2 2 -6523.6606 6527.661 -9978.1366 9982.137
## Oct 2 901 -5624.6606 7426.661 -9079.1366 10881.137
## Nov 2 4809 -1716.6606 11334.661 -5171.1366 14789.137
## Dec 2 9785 3259.3394 16310.661 -195.1366 19765.137
## Jan 3 5998 -527.6606 12523.661 -3982.1366 15978.137
## Feb 3 868 -8360.6777 10096.678 -13246.0445 14982.045
## Mar 3 1307 -7921.6777 10535.678 -12807.0445 15421.045
## Apr 3 1340 -7888.6777 10568.678 -12774.0445 15454.045
## May 3 833 -8395.6777 10061.678 -13281.0445 14947.045
## Jun 3 1205 -8023.6777 10433.678 -12909.0445 15319.045
## Jul 3 1280 -7948.6777 10508.678 -12834.0445 15394.045
## Aug 3 0 -9228.6777 9228.678 -14114.0445 14114.045
## Sep 3 2 -9226.6777 9230.678 -14112.0445 14116.045
## Oct 3 901 -8327.6777 10129.678 -13213.0445 15015.045
## Nov 3 4809 -4419.6777 14037.678 -9305.0445 18923.045
## Dec 3 9785 556.3223 19013.678 -4329.0445 23899.045
## Jan 4 5998 -3230.6777 15226.678 -8116.0445 20112.045
snaive_model_result[[3]][,'MAE']
## Training set Test set
## 5092.00 4080.75
TBATS is the combination of Trigonometric seasonality, Box-Cox transformation, ARIMA errors Trend, Seasonal components
This methodolgy tends to perform well when there’s more than one single distribution in a time serie. Something that is presented in all of three.
tbats_models[[1]]
## TBATS(1, {1,0}, 1, {<12,5>})
##
## Call: tbats(y = subseries)
##
## Parameters
## Alpha: -0.02011887
## Beta: 0.0001004308
## Damping Parameter: 1
## Gamma-1 Values: 1.445945e-05
## Gamma-2 Values: 0.0002511197
## AR coefficients: -0.983529
##
## Seed States:
## [,1]
## [1,] 124.21924
## [2,] 4.75464
## [3,] 28.59260
## [4,] 50.28093
## [5,] -36.73593
## [6,] -19.23905
## [7,] 11.54502
## [8,] 84.15764
## [9,] -87.81545
## [10,] 16.37373
## [11,] -66.74894
## [12,] 26.35279
## [13,] 0.00000
##
## Sigma: 5.515416
## AIC: 113.7406
tbats_model_result[[1]][,'MAE']
## Training set Test set
## 4.058811 62.658474
tbats_models[[2]]
## TBATS(1, {1,0}, 1, {<12,5>})
##
## Call: tbats(y = subseries)
##
## Parameters
## Alpha: -0.0201189
## Beta: 0.000100343
## Damping Parameter: 1
## Gamma-1 Values: 1.445892e-05
## Gamma-2 Values: 0.0002511196
## AR coefficients: -0.983523
##
## Seed States:
## [,1]
## [1,] -2.2688061
## [2,] 2.0025719
## [3,] 8.7976634
## [4,] -2.2178185
## [5,] -3.2182471
## [6,] 2.9482767
## [7,] -0.7315858
## [8,] -0.6949788
## [9,] -4.9074773
## [10,] 1.6666667
## [11,] 1.1547005
## [12,] -2.1383545
## [13,] 0.0000000
##
## Sigma: 0.1386042
## AIC: 17.96488
tbats_model_result[[2]][,'MAE']
## Training set Test set
## 0.1019991 3.6632624
tbats_models[[3]]
## TBATS(1, {1,0}, 1, {<12,5>})
##
## Call: tbats(y = subseries)
##
## Parameters
## Alpha: -0.02011887
## Beta: 0.0001004308
## Damping Parameter: 1
## Gamma-1 Values: 1.445945e-05
## Gamma-2 Values: 0.0002511197
## AR coefficients: -0.983529
##
## Seed States:
## [,1]
## [1,] -828.32414
## [2,] 425.22100
## [3,] 2148.93838
## [4,] 806.42866
## [5,] -535.96929
## [6,] -423.10193
## [7,] -579.48928
## [8,] 628.60792
## [9,] -1351.43264
## [10,] -934.16667
## [11,] -590.58121
## [12,] -70.27458
## [13,] 0.00000
##
## Sigma: 47.83824
## AIC: 169.9078
tbats_model_result[[3]][,'MAE']
## Training set Test set
## 35.20431 2137.10158
After the previous analysis we discovered that: * TBATS had the best performance for average monthly visitors. Even though it had overfitting, had the least MAE against the test set. Future analysis would be required to maintain the forecast. * Naive forecast had a good result on unique restaurants. Its a simple method but when erratic, it is a good and easy approach to forecast.
Now I evaluated which model is more accurate. I did an ensemble model multiplying average visitors by unique restaurants results and compared it against total visitors forecast. I selected the method with the least MAE against test set.
mae_ttl_visitors
## total_visitors_ensemble_mae total_visitors_tbats_mae
## 1986.521 2137.102
Ensemble model is more accurate. It has an average absolute error of 1986. That means that the result in average each month will have a forecast ± 1986 total visitors.
Once the method was selected, I did and update on naive and TBATS model so they included the latest data. With that I forecasted from Jun to Nov
Based on the data and your ideas, plan strategies to double the total restaurant visitors in six months.
Strategies that in total could double the amount of visitors
restaurants_position
Below is a Japan population density plot
visitors_by_genre.barplot
days_from_reserve_to_visit.boxplot
Note: You can zoom-in where the boxplots are located, all the outliers impact zooming-out the boxplot graph.
Imagine that these restaurants are in your city (and not in Japan), what other data would you want to join in order of get more insights to increase the visitors?
This is crucial to understand what drives the user to reserve or purchase anything. Age, consumption habits, etc.
Maybe there’s a high traffic of requests to visit a place but something in the funnel makes the user drop the reservation.
There could be the opportunity to understand demand elasticity given different prices. I’ve stopped placing orders in some restaurants because the price increased a lot.
How many active users and new users do we have for each week of November 2019 to February 2020?
Definitions New user definition: Count of unique user ids on each user min order_info.order_date Active user definition: Unique user with order grouped by order_info.order_date.
WITH
new_users as (
SELECT
user_id,
min(order_date) as order_date
FROM order_info oi
GROUP BY user_id
),
weekly_new_users as (
select
DATE_SUB(order_date,
INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week,
count(distinct user_id) as new_users
from new_users
group by order_week
),
weekly_unique_active_users as (
SELECT
DATE_SUB(order_date,
INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week,
count(distinct user_id) as active_users
FROM order_info oi
GROUP BY order_week
)
SELECT *
FROM weekly_unique_active_users wau
LEFT JOIN weekly_new_users wnu USING (order_week)
WHERE order_week between '2019-11-01' and '2020-02-29'
ORDER BY order_week
| order_week | active_users | new_users |
|---|---|---|
| 2019-11-04 | 7210 | 7009 |
| 2019-11-11 | 8651 | 7544 |
| 2019-11-18 | 10002 | 8233 |
| 2019-11-25 | 10905 | 8676 |
| 2019-12-02 | 12140 | 9458 |
| 2019-12-09 | 12665 | 9161 |
| 2019-12-16 | 11056 | 6730 |
| 2019-12-23 | 8207 | 4638 |
| 2019-12-30 | 8318 | 4730 |
| 2020-01-06 | 9774 | 5614 |
For question 4
new_and_active_weekly_users.lineplot
How many reengaged users do we have (Reengaged: active this week that didn’t have an order last week but they did before that) for each week of November 2019 to February 2020?
Reengaged Users metric definition: Count of unique user ids where the number of weeks between current order date and previous one is greater than 1.
WITH
user_by_week AS (
SELECT
*,
lag(order_week, 1) OVER (PARTITION BY user_id ORDER BY user_id) previous_order_week
FROM
(SELECT
user_id,
DATE_SUB(order_date,
INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week
FROM order_info oi
GROUP BY user_id, order_week
ORDER BY user_id, order_week) upw
),
reengaged_users AS (
SELECT *, floor(DATEDIFF(order_week, previous_order_week)/7) as inactive_weeks
FROM user_by_week ubw
WHERE floor(DATEDIFF(order_week, previous_order_week)/7) > 1
)
SELECT order_week, count(distinct user_id) as reengaged_users
FROM reengaged_users ru
WHERE order_week between '2019-11-01' and '2020-02-29'
GROUP BY order_week
ORDER BY order_week
| order_week | reengaged_users |
|---|---|
| 2019-11-11 | 54 |
| 2019-11-18 | 546 |
| 2019-11-25 | 821 |
| 2019-12-02 | 1254 |
| 2019-12-09 | 1704 |
| 2019-12-16 | 2241 |
| 2019-12-23 | 1955 |
| 2019-12-30 | 2207 |
| 2020-01-06 | 2727 |
| 2020-01-13 | 2633 |
For question 4
weekly_reengaged_users.lineplot
What’s the average GMV by type of user (Active, new, reengaged) for each week of November 2019 to February 2020?
IMPORTANT: The definitions of each of the three user types are not mutually exclusive, new users and reengaged users are in fact active users.
WITH
user_by_week AS (
SELECT
*,
lag(order_week, 1) OVER (PARTITION BY user_id ORDER BY user_id) previous_order_week
FROM
(SELECT
user_id,
DATE_SUB(order_date,
INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week
FROM order_info oi
GROUP BY user_id, order_week
ORDER BY user_id, order_week) upw
),
reengaged_users AS (
SELECT *, floor(DATEDIFF(order_week, previous_order_week)/7) as inactive_weeks
FROM user_by_week ubw
WHERE floor(DATEDIFF(order_week, previous_order_week)/7) > 1
),
weekly_reengaged_users AS (
SELECT order_week, user_id
FROM reengaged_users ru
GROUP BY order_week, user_id
ORDER BY order_week, user_id
),
reengaged_users_orders AS (
SELECT
wru.order_week,
avg(gmv) as avg_gmv
FROM weekly_reengaged_users wru
LEFT JOIN order_info oi ON
(wru.user_id = oi.user_id AND
DATE_SUB(oi.order_date,
INTERVAL DAYOFWEEK(order_date)-2 DAY) =
wru.order_week)
GROUP BY wru.order_week
),
new_users_orders AS (
SELECT
DATE_SUB(oi.order_date,
INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week,
avg(gmv) as avg_gmv
FROM user_info ui
LEFT JOIN order_info oi ON
(ui.first_order = oi.order_id AND oi.user_id = ui.user_id)
GROUP BY order_week
),
active_users_orders as (
SELECT
DATE_SUB(order_date,
INTERVAL DAYOFWEEK(order_date)-2 DAY) as order_week,
avg(gmv) as avg_gmv
FROM order_info oi
GROUP BY order_week
)
SELECT *, 'New User' AS user_type
FROM new_users_orders nuo
WHERE nuo.order_week between '2019-11-01' and '2020-02-29'
UNION
SELECT *, 'Active User' AS user_type
FROM active_users_orders auo
WHERE auo.order_week between '2019-11-01' and '2020-02-29'
UNION
SELECT *, 'Reengaged User' AS user_type
FROM reengaged_users_orders ruo
WHERE ruo.order_week between '2019-11-01' and '2020-02-29'
| order_week | avg_gmv | user_type |
|---|---|---|
| 2019-11-18 | 161.1856 | New User |
| 2019-11-25 | 158.0916 | New User |
| 2020-01-06 | 164.6383 | New User |
| 2019-12-16 | 170.8322 | New User |
| 2020-01-20 | 152.4472 | New User |
| 2020-02-24 | 159.7361 | New User |
| 2019-11-04 | 139.1320 | New User |
| 2020-02-10 | 166.2084 | New User |
| 2020-02-17 | 164.1437 | New User |
| 2019-12-09 | 166.3906 | New User |
For question 4
avg_gmv_per_user_type.lineplot
On your preferred tool (Excel, Python, R, etc.) please create charts for each of your results.
I placed the graphs on each question
Based on the charts give your opinion/recommendations regarding to the different type of users.
It’s important to have order_date as time stamp. It allows you to understand whats the number of orders before churning and the level of engagement the users have with your application.
avg_gmv_by_order_n.lineplot
Order number definition: Number of order by user_id, is cumulative.
It’s important to have order_date as time stamp. It allows you to understand whats the number of orders before churning and the level of engagement the users have with your application.
inactive_users.sankeyplot
days_to_recur.boxplot